Add `StructArray` and `RunArray` benchmark tests to `with_hashes` by notashes · Pull Request #20182 · apache/datafusion

notashes · 2026-02-06T08:55:20Z

Which issue does this PR close?

Closes Add StructArray and RunArray benchmarks to with_hashes suite in datafusion-common #20181

Rationale for this change

Issue #20152 shows some areas of optimization for RunArray and StructArray hashing. But the existing with_hashes benchmark tests don't include coverage for these!

What changes are included in this PR?

Added benchmarks to with_hashes.rs:

StructArray: 4-column struct (bool, int32, int64, string)
RunArray: Int32 run-encoded array
Both include single/multiple columns and with/without nulls

Are these changes tested?

No additional tests added, but the benchmarks both compile and run.

a sample run:

❯ cargo bench --features=parquet --bench with_hashes -- array
   Compiling datafusion-common v52.1.0 (/Users/notashes/dev/datafusion/datafusion/common)
    Finished `bench` profile [optimized] target(s) in 34.49s
     Running benches/with_hashes.rs (target/release/deps/with_hashes-2f180744d22084f3)
Gnuplot not found, using plotters backend
struct_array: single, no nulls
                        time:   [38.389 µs 38.437 µs 38.485 µs]
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  2 (2.00%) high mild

struct_array: single, nulls
                        time:   [46.108 µs 46.197 µs 46.291 µs]
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

struct_array: multiple, no nulls
                        time:   [114.64 µs 114.79 µs 114.93 µs]
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  1 (1.00%) high mild

struct_array: multiple, nulls
                        time:   [138.29 µs 138.62 µs 139.07 µs]
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low severe
  4 (4.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe

run_array_int32: single, no nulls
                        time:   [1.8777 µs 1.9098 µs 1.9457 µs]
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

run_array_int32: single, nulls
                        time:   [2.0110 µs 2.0417 µs 2.0751 µs]
Found 7 outliers among 100 measurements (7.00%)
  6 (6.00%) high mild
  1 (1.00%) high severe

run_array_int32: multiple, no nulls
                        time:   [5.0511 µs 5.0603 µs 5.0693 µs]
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild

run_array_int32: multiple, nulls
                        time:   [5.6052 µs 5.6201 µs 5.6353 µs]
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

Are there any user-facing changes?

datafusion/common/benches/with_hashes.rs

Jefffrey · 2026-02-08T03:05:37Z

datafusion/common/benches/with_hashes.rs

-            do_hash_test(b, &arrays);
-        });
+
+        // Union arrays can't have null bitmasks


Mentioning union array when we don't implement that here?

I've copied that from the other PR verbatim 😅 (to not have merge conflicts in the future?). but I'm getting a sense that it's the wrong approach here!

Jefffrey · 2026-02-08T03:07:03Z

datafusion/common/benches/with_hashes.rs

+                    .clone()
+                    .into_data()
+                    .into_builder()
+                    .nulls(Some(create_null_mask(values.len())))


Something to think about is how null density acts differently here for run arrays, since we'd apply null on entire runs 🤔

I was thinking about it for a while. It probably should come up to be around the same 3% zone even though the variance could be a bit high.

I've set the run_length to be within 1..50.
Let's say we have ~300 runs on average, with each each one carrying ~25 elements. 3% of which will roughly translate to 10 * 25 = 250. But yes that is probably our ideal scenario.

let me know what you think? i'll try to do some testing regarding this.

Jefffrey · 2026-02-08T03:07:31Z

datafusion/common/benches/with_hashes.rs

+    )
+}
+
+fn string_array(array_len: usize) -> ArrayRef {


Do we need this if we already have StringPool above?

done! don't think a different one offers any benefit! both seem to give me close to 10% speed up locally (with the struct_array optimization)

bench: adds benchmark tests for StructArray and RunArray

2adccac

github-actions bot added the common Related to common crate label Feb 6, 2026

Merge branch 'main' into with_hashes

6dac5a6

notashes mentioned this pull request Feb 6, 2026

perf: various optimizations to eliminate branch misprediction in hash_utils #20168

Open

Jefffrey reviewed Feb 7, 2026

View reviewed changes

datafusion/common/benches/with_hashes.rs Outdated Show resolved Hide resolved

datafusion/common/benches/with_hashes.rs Outdated Show resolved Hide resolved

datafusion/common/benches/with_hashes.rs Outdated Show resolved Hide resolved

notashes and others added 2 commits February 7, 2026 22:51

Merge branch 'main' into with_hashes

caa31e6

fix: null_array with null creation overhaul

b2aca22

Jefffrey reviewed Feb 8, 2026

View reviewed changes

fix: use existing stringpool for struct_array

0918040

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `StructArray` and `RunArray` benchmark tests to `with_hashes`#20182

Add `StructArray` and `RunArray` benchmark tests to `with_hashes`#20182
notashes wants to merge 5 commits intoapache:mainfrom
notashes:with_hashes

notashes commented Feb 6, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Jefffrey Feb 8, 2026

Uh oh!

notashes Feb 8, 2026 •

edited

Loading

Uh oh!

Jefffrey Feb 8, 2026

Uh oh!

notashes Feb 8, 2026

Uh oh!

Jefffrey Feb 8, 2026

Uh oh!

notashes Feb 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

notashes commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Jefffrey Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

notashes Feb 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jefffrey Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

notashes Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

Jefffrey Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

notashes Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

notashes commented Feb 6, 2026 •

edited

Loading

notashes Feb 8, 2026 •

edited

Loading